With the nadir of the SARS-CoV-2 pandemic finally subsiding, the task of understanding the epidemiological factors contributing to the propagation of COVID-19 has only begun. Understanding relationships between COVID-19 spread prevention and socioeconomic variables will prove vital to inform us of how to mitigate the propagation of the next pandemic. This report aims to understand how the economic, education, behavioral, and population data for the 3141 U.S. counties relate to COVID-19 case data.
We acknowledge that the impact of and response to COVID-19 has been very different from county to county. Looking at the current COVID-19 vaccination data in mid-May 2020, we note that the vaccination rate for ages 18+ ranges from 11% in some counties in Lousiana to 74% in some counties in New York. Our guiding question is: Who is the most vulnerable to COVID-19 infection and death? This knowledge will guide public health efforts as we continue to fight against the spread of COVID-19. Knowledge of what socioeconomic factors put people at risk will allow us to prioritize our vaccination and education efforts from those who need it the most and will also let us take a step back to acknowledge the systemic health inequalities in our country.
This report deploys three multivariate techniques to examine three questions in particular:
How do the 3141 counties differ from one another, i.e., how do the socioeconomic and COVID-19 data relate to one another when distinguishing U.S. counties? Principal components analysis (PCA) will help to reduce the dimensionality of our large dataset, increasing interpretability of underlying trends between clusters of variables. This metric technique works on the “columns” of our data set to reduce them into composite variables and make them more interpretable.
Which U.S. counties are similar to one another? Cluster analysis will enable the clustering of selected counties into a discrete number of groups based on similar socioeconomic and COVID-19 data. This metric technique works on the “rows” of our data set to find similar groups of observations.
Which U.S. county variable pairings are similar to one another? Correspondence analysis is similar to PCA but applies to categorical rather than continoous variables. This nonmetric technique works on both the “columns” and “rows” of our data set to visualize which rows and column points are similar in lower-dimensional space.
Using these techniques, we will be able to better understand our variables, our observations, and the interactions between our variables and observations. Who is most vulnerable to COVID-19 infection and death? This allows us to direct resources to protecting these vulnerable populations.
The dataset referenced in this report includes COVID-19 infection and death statistics from U.S. counties (sourced from Johns Hopkins, as of 28 April 2021), combined with economic, education, and population data (sourced from various government agencies) and also survey responses about mask-wearing frequencies (sourced from NYT).
3141 complete observations on 10 continuous variables and 6 categorical variables. Continuous variables were rescaled as percentages of county population.
6 categorical variables: FIPS, county name, state name, rural urban type, rural urban code, economic typology
9 continuous variables: “Always” wear mask survey response percent, unemployment rate, median household income, percent poverty, percent of adults with less than a high school education, death rate, percent civilian labor force, percent of county population that has had confirmed COVID-19 cases, and percent of county population that has died from COVID-19.
[1] “FIPS” = State-County FIPS Code; Categorical (identifier)
[2] “County_Name” = US County Name; Categorical (identifier)
[3] “State_Name” = US State Name; Categorical
[4] “Rural_Urban_Type” = Regrouping of Rural-Urban Codes (2013) numbered 1-9 according to descriptions provided by the USDA. See variable [5]. Regroup codes 1 through 9 into three groups: (1) “Urban” for codes 1-3, (2) “Suburban” for codes 4-6, and (3) “Rural” for codes 7-9; Categorical (1-3)
[5] “Rural_Urban_Code_2013” = Rural-urban Continuum Code, 2013; (https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/); Categorical (1-9)
[6] “Economic_Typology_2015” = County economic types, 2015 edition (https://www.ers.usda.gov/data-products/county-typology-codes/); Non-overlapping economic-dependence county indicator. 0=Nonspecialized 1=Farm-dependent 2=Mining-dependent 3=Manufacturing-dependent 4=Federal/State government-dependent 5=Recreation; Categorical (0-5)
[7] “Always_Wear_Mask_Survey” = “Always” response. The New York Times administered a survey to 250,000 Americans from July 2 to July 14 asking the following question: How often do you wear a mask in public when you expect to be within six feet of another person?; Continous (%)
[8] “Unemployment_Rate_2019” = Unemployment rate, 2019; Continuous (%)
[9] “Median_Household_Income_2019” = Estimate of median household Income, 2019; Continous ($)
[10] “Percent_Poverty_2019” = Estimate of people of all ages in poverty 2019; Continuous (%)
[11] “Percent_Adults_Less_Than_HS” = Percent of adults with less than a high school diploma, 2014-18
[12] “Death_Rate_2019” = Death rate in period 7/1/2018 to 6/30/2019; Continuous (%)
[13] “Civilian_Labor_Force_2019_as_pct” = Civilian labor force annual average, 2019, expressed as percent; Continuous (%)
[14] “Covid_Confirmed_Cases_as_pct” = Cumulative sum of COVID-19 cases expressed as percent. Reported from Johns Hopkins on 28 April 2021; Continuous (%)
[15] “Covid_Deaths_as_pct” = Cumulative sum of COVID-19 deaths expressed as percent. Reported from Johns Hopkins on 28 April 2021; Continuous (%)
We made normal quantile plots for each of the 9 continuous variables in the dataset. This revealed that most variables initially did not have a univariate normal distribution. Taking the log-transform of the 10 continuous variables helped most variables have more linear quantile plots. Note that we also standardized the continuous variables since they were measured on different scales. Moreover, for death rate, percent COVID-19 cases, percent COVID-19 deaths, a 1.5 x IQR outlier exclusion method was applied to enable these variables to take on more normal univariate distributions. Note that the outlier exclusion method reduced the number of counties that we will analyze to 2,814 observation. Hence, this outlier exclusion method reduced the dataset by approximately 10%. This is somewhat high; however, we deemed that the benefits of having univariate and multivariate distributions to outweigh this disadvantage. We these changes made, the 9 continuous variables all had univariate normal distributions.
A chi-square quantile plot (shown above) reflects that our data does not have a multivariate normal distribution. Thus, none of the techniques we use will require a multivariate normal distribution.
We note many variables highly correlated with other variables, which is good news for PCA. For instance, the correlation between the log of the unemployment rate and the labor force as a percent is -.065, the correlation between the log of the median household income and percent poverty is -0.88, and the correlation between the percent of COVID-19 cases and the percent of COVID-19 deaths is 0.47. There appear to be underlying trends about the counties (about beliefs about COVID-19, about wealth/education, etc) that could be summarized in linear combinations of the 19 metric variables we have currently.
## [1] 2814 15
##
## Rural Suburban Urban
## 907 841 1066
##
## 0 1 2 3 4 5
## 1146 392 196 480 351 249
## Economic_Typology_2015 Always_Wear_Mask_Survey_Log Unemployment_Rate_2019_Log
## Min. :0.000 Min. :-4.38292 Min. :-3.06445
## 1st Qu.:0.000 1st Qu.:-0.65971 1st Qu.:-0.61644
## Median :1.000 Median : 0.04834 Median :-0.07051
## Mean :1.732 Mean :-0.02779 Mean : 0.01558
## 3rd Qu.:3.000 3rd Qu.: 0.69544 3rd Qu.: 0.60164
## Max. :5.000 Max. : 1.90553 Max. : 5.03446
## Median_Household_Income_2019_Log Percent_Poverty_2019_Log
## Min. :-3.23567 Min. :-3.06671
## 1st Qu.:-0.67529 1st Qu.:-0.59390
## Median :-0.09013 Median : 0.04824
## Mean :-0.05574 Mean : 0.04623
## 3rd Qu.: 0.51653 3rd Qu.: 0.70815
## Max. : 3.79491 Max. : 3.22673
## Percent_Adults_Less_Than_HS_Log Death_Rate_2019_Log
## Min. :-4.18428 Min. :-1.4689
## 1st Qu.:-0.57692 1st Qu.:-0.2168
## Median : 0.09978 Median : 0.1861
## Mean : 0.07065 Mean : 0.1237
## 3rd Qu.: 0.78307 3rd Qu.: 0.5328
## Max. : 2.90401 Max. : 1.6538
## Civilian_Labor_Force_2019_as_pct_Log Covid_Confirmed_Cases_as_pct_Log
## Min. :-5.00174 Min. :-0.82906
## 1st Qu.:-0.57127 1st Qu.:-0.05023
## Median : 0.06175 Median : 0.17048
## Mean :-0.03923 Mean : 0.15011
## 3rd Qu.: 0.62656 3rd Qu.: 0.37812
## Max. : 5.25580 Max. : 1.08067
## Covid_Deaths_as_pct_Log
## Min. :-1.7523
## 1st Qu.:-0.2623
## Median : 0.2318
## Mean : 0.1925
## 3rd Qu.: 0.6637
## Max. : 2.0837
We can see that there are data from 2814 of the 3006 counties in the US, with a fairly narrow distribution of the number of counties of each rural-urban type: 907 rural, 841 suburban, and 1066 urban. The distributions in the quantitative variables are pretty consistent, which we also saw in the box plot above, meaning that standardization might not make a huge difference. Let’s go ahead and make a standardized version of the data set anyway, though.
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9755479 1.2823031 1.0139924 0.87008610 0.77310275
## Proportion of Variance 0.4336433 0.1827002 0.1142423 0.08411665 0.06640976
## Cumulative Proportion 0.4336433 0.6163434 0.7305857 0.81470236 0.88111212
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.61072705 0.58241257 0.52776316 0.281540505
## Proportion of Variance 0.04144306 0.03768938 0.03094822 0.008807228
## Cumulative Proportion 0.92255518 0.96024455 0.99119277 1.000000000
## Warning in if (loadings) {: the condition has length > 1 and only the first
## element will be used
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Always_Wear_Mask_Survey_Log 0.01 0.52 0.57 0.45 0.11 0.25
## Unemployment_Rate_2019_Log -0.34 0.32 -0.01 0.02 -0.70 -0.07
## Median_Household_Income_2019_Log 0.45 0.07 0.21 0.14 -0.21 0.03
## Percent_Poverty_2019_Log -0.46 0.04 0.01 -0.22 0.22 -0.07
## Percent_Adults_Less_Than_HS_Log -0.39 0.04 0.32 -0.20 0.44 0.23
## Death_Rate_2019_Log -0.29 -0.11 -0.51 0.62 0.03 0.47
## Civilian_Labor_Force_2019_as_pct_Log 0.41 -0.20 -0.04 0.13 0.27 0.14
## Covid_Confirmed_Cases_as_pct_Log -0.09 -0.60 0.40 -0.16 -0.38 0.52
## Covid_Deaths_as_pct_Log -0.23 -0.46 0.31 0.52 0.02 -0.60
## Comp.7 Comp.8 Comp.9
## Always_Wear_Mask_Survey_Log 0.25 0.21 0.12
## Unemployment_Rate_2019_Log 0.15 -0.51 0.00
## Median_Household_Income_2019_Log -0.41 -0.10 -0.71
## Percent_Poverty_2019_Log 0.45 0.16 -0.68
## Percent_Adults_Less_Than_HS_Log -0.46 -0.50 0.02
## Death_Rate_2019_Log -0.14 0.01 -0.14
## Civilian_Labor_Force_2019_as_pct_Log 0.54 -0.62 -0.05
## Covid_Confirmed_Cases_as_pct_Log 0.12 0.14 0.02
## Covid_Deaths_as_pct_Log -0.04 -0.08 0.00
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## 3.90 1.64 1.03 0.76 0.60 0.37 0.34 0.28 0.08
According to the total variance explained method, using a cutoff of 1, the first 3 PC’s should be used. According to the Eigenvalue > 1 method, the first 3 PC’s should be used. According to the scree plot elbow method, the first 1 PC’s should be used. We choose to maintain the first 3 PC’s in accordance with the first two methods for a parsimonous but still informative model.
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9755479 1.2823031 1.0139924 0.87008610 0.77310275
## Proportion of Variance 0.4336433 0.1827002 0.1142423 0.08411665 0.06640976
## Cumulative Proportion 0.4336433 0.6163434 0.7305857 0.81470236 0.88111212
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.61072705 0.58241257 0.52776316 0.281540505
## Proportion of Variance 0.04144306 0.03768938 0.03094822 0.008807228
## Cumulative Proportion 0.92255518 0.96024455 0.99119277 1.000000000
## Warning in if (loadings) {: the condition has length > 1 and only the first
## element will be used
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Always_Wear_Mask_Survey_Log 0.01 0.52 0.57 0.45 0.11 0.25
## Unemployment_Rate_2019_Log -0.34 0.32 -0.01 0.02 -0.70 -0.07
## Median_Household_Income_2019_Log 0.45 0.07 0.21 0.14 -0.21 0.03
## Percent_Poverty_2019_Log -0.46 0.04 0.01 -0.22 0.22 -0.07
## Percent_Adults_Less_Than_HS_Log -0.39 0.04 0.32 -0.20 0.44 0.23
## Death_Rate_2019_Log -0.29 -0.11 -0.51 0.62 0.03 0.47
## Civilian_Labor_Force_2019_as_pct_Log 0.41 -0.20 -0.04 0.13 0.27 0.14
## Covid_Confirmed_Cases_as_pct_Log -0.09 -0.60 0.40 -0.16 -0.38 0.52
## Covid_Deaths_as_pct_Log -0.23 -0.46 0.31 0.52 0.02 -0.60
## Comp.7 Comp.8 Comp.9
## Always_Wear_Mask_Survey_Log 0.25 0.21 0.12
## Unemployment_Rate_2019_Log 0.15 -0.51 0.00
## Median_Household_Income_2019_Log -0.41 -0.10 -0.71
## Percent_Poverty_2019_Log 0.45 0.16 -0.68
## Percent_Adults_Less_Than_HS_Log -0.46 -0.50 0.02
## Death_Rate_2019_Log -0.14 0.01 -0.14
## Civilian_Labor_Force_2019_as_pct_Log 0.54 -0.62 -0.05
## Covid_Confirmed_Cases_as_pct_Log 0.12 0.14 0.02
## Covid_Deaths_as_pct_Log -0.04 -0.08 0.00
Looking at PC1: This principle component seems to be related to wealth, work status, and education. It combines log percent poverty (-0.46) with the log of the median household income (0.45), the log of the civilian labor force (0.41), and the percent of adults with a bachelor’s degree or higher (-0.39). A higher value on this PC indicates more employment, more jobs, and more education.
Looking at PC2: This principle component seems to be a measure of masking behaviors and relation to COVID-19 infection. It combines the percentage of those who say they always mask (0.52), the log of the COVID-19 infection rate (-0.60), and the log of the COVID-19 death rate (-0.46). A higher value on this PC indicates more masking and less COVID-19 infection and death.
Looking at PC3: This principle component seems to be a combination of the first two and relates underlying traits about the county to COVID-19 infection. It combines the log of the 2019 death rate (-0.51) with the cumulative percentage of population with the log of the percentage of adults with less than a high school degree (0.32), the log of the COVID-19 infection rate (0.40), and the log of the COVID-19 death rate (0.31). A higher value on this PC indicates less education and more COVID-19 infection and death.
Using PCA, we can reduce these 9 metric variables to 3 composite variables that are related to wealth and education, attitudes about masking, and population. These 3 PC’s can account for 73% of the total variability, which is fairly effective! We note that COVID-19 infection and death rates are related to attitudes and behaviors, like the masking rate, but is also due to factors outside of the control of a county’s population, like unemployment and education.
There are peaks in the RMSSTD at 3, 9, and most notably 12,indicating that these may be reasonable group counts. SPRSQ tapers at 9, supporting the idea that there may be 9 groups. However, the tapering is far more prominent at 2 and 4 and is echoed in the RSQ, so a lower group count may be indicated. Let’s go back to the dendrogram to see how this fits.
The large spike in RMSSTD is very promising, and the dendrogram clustering looks apt as well, but we don’t want to run the risk of having too many clusters!
## Group.1 Always_Wear_Mask_Survey_Log Unemployment_Rate_2019_Log
## 1 1 -0.2507423 -0.7824498
## 2 2 0.2337914 0.1154028
## 3 3 -0.0251096 0.7822435
## Median_Household_Income_2019_Log Percent_Poverty_2019_Log
## 1 0.91330915 -0.95850252
## 2 -0.04707261 0.05347171
## 3 -1.03586352 1.08141832
## Percent_Adults_Less_Than_HS_Log Death_Rate_2019_Log
## 1 -0.81881942 -0.6763568
## 2 0.01449156 0.1654341
## 3 0.96752538 0.5841439
## Civilian_Labor_Force_2019_as_pct_Log Covid_Confirmed_Cases_as_pct_Log
## 1 0.86216565 0.1251604
## 2 0.02127486 -0.3657841
## 3 -1.06993756 0.3615719
## Covid_Deaths_as_pct_Log
## 1 -0.3230976
## 2 -0.1799121
## 3 0.6418958
Recall that we are clustering U.S. counties based on 9 metric, standardized variables. We are trying to find groups of counties with members that are similar to each other but different from other groups - we are finding clusters of observations, unlike PCA where we found clusters of variables. For the most parsimonous model, we examine 3 clusters that primarily differ on their wealth/education/employment and COVID-19 infection and death rates.
Clusters 1 and 3 are relatively well-off economically. Both have high household incomes, low poverty rates, high civilian labor forces, low unemployment rates, and high education rates. Cluster 3 is more well-off than Cluster 1, perhaps representing those with higher paying jobs. The biggest difference, though, is in COVID-19 responses. Cluster 1 has high COVID-19 infection (0.53) and death rates (0.22), while Cluster 3 has low COVID-19 infection (-0.60) and death rates (-0.75). This difference can potentially be tied back to behavioral differences: counties in Cluster 3 always mask (0.68) while counties in Cluster 1 do not (-0.88). It is important to note that this difference is not necessarily due to any sort of moral gap but more likely due to a gap in resources - it is a privilege to be able to stay informed on scientific discoveries, purchase masks, and maintain social distancing.
In contrast to Clusters 1 and 3, Cluster 2 is underprivileged, with low household income (-0.82), high poverty (0.85), low civilian labor force (-0.82), and a high percent with less than a high school degree (0.76). Cluster 2 is hit the hardest by COVID-19, with the highest death rate (0.42). Here we most clearly see the connection between underlying economic factors and the impact of COVID-19. Members of these communities tend to have jobs as essential workers that require work outside of the home. They may have to take public transit. They may be unable to afford grocery delivery services. Affluent communities have the resources to avoid COVID-19 transmission, while impoverished communities may not. These communities are likely to have preexisting conditions that worsen its effects and may lack quality healthcare or health insurance. We also note a potential gap in testing in these communities - death rates are high, but the reported infection rate is incredibly low (-0.01). As we continue to distribute COVID-19 vaccinations, special attention should be placed on supporting these communities most at risk for severe negative consequences associated with COVID-19.
This report examined connections betweenthe economic, education, behavioral, and population data for the 3141 U.S. counties and COVID-19 infection and death rate data.
Using PCA, we reduced the dimensionality of our large dataset to 3 principle components to explain 73% of the total variability. We highlight that there are two main factors that are associated with risk of COVID-19 infection and death: 1) masking behaviors and 2) socioeconomic status. Masking and wealth are associated with lower COVID-19 infection and death rates. Using cluster analysis, we determined which counties are most similar to each other: there are very well-off counties with high mask compliance and low COVID-19 rates, moderately well-off counties with low-mask compliance and high COVID-19 rates, and impoverished counties with high COVID-19 rates. This clustering implies that differences in masking behaviors and COVID-19 infection may not be due to any sort of moral gap but more likely due to a gap in resources - it is a privilege to be able to stay informed on scientific discoveries, purchase masks, work from home, and maintain social distancing. We also note the that impoverished counties have much higher COVID-19 death rates and may have preexisting conditions that worsen its effects and lack quality healthcare or health insurance.
We can observe these connections, but we cannot make any cause-and-effect statements based on our current observational study. However, even without knowing the cause, we can say that vaccine and education efforts should be prioritized in underprivileged communities with lower masking rates - these communities are being hit the hardest by COVID-19.
We hope that studies of COVID-19 death and infection rates will continue, even as vaccination rates increase, so we can find the communities who can benefit from public health efforts. We also hope that these public efforts extend beyond just COVID-19 assistance; our work has highlighted the connection between socioeconomic factors and infection and death rates. While we are unable to examine the causal nature of this relationship with this data set, hopefully future studies will probe at why this connection exists and present solutions.
We note that COVID-19 is a pandemic, impacting the entire world. Though we only studied counties in the United States, it would be worthwhile to study other countries to understand how to prioritize not only vaccination efforts in the U.S. but in the world. Vaccination is a world-wide effort, and none of us are protected until we are all protected.